Mapping of Sequence Reads to the Reference Genomes    ◾    53

the reference genome in a process known as read sequence mapping or alignment. In the

read mapping process, the FASTA files may contain millions of read sequences that we

wish to align to a sequence of a reference genome to produce aligned reads in a file format

called SAM, which stands for Sequence Alignment Map format. The aligned reads can also

be stored in the SAM binary form called BAM (Binary Alignment Map format). We will

discuss this file format later in some detail.

In general, sequence mapping or alignment requires three elements: A reference file in

the FASTA format, short-sequence reads in FASTQ files, and an aligner, which is a program

that uses an algorithm to align reads to a reference genome sequence. We have already

discussed how to download the sequence of a reference genome of an organism from the

NCBI Genome database. However, before using a reference genome with any aligner, it

may require indexing with the “samtools faidx” command. You can download and install

Samtools by following the instructions available at “http://www.htslib.org/download/”. On

Ubuntu, you can install it using the following command:

sudo apt-get install samtools

Once you have installed Samtools successfully, you can use that tool to index the reference

genome and other tasks that you will learn later.

You have already downloaded the human reference genome above. If you didn’t do that,

you can download and decompress it using the following commands:

mkdir refgenome

wget \

-O “refgenome/GRCh38.p13_ref.fna.gz” \

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/

GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.

fna.gz

cd refgenome

gunzip -d GRCh38.p13_ref.fna.gz

FIGURE 2.3  Part of the human annotation file in GTF file format.